Scripting, RMarkdown, & Git
September 2015
Scripting, RMarkdown, & Git
Slides based on Ben Marwick's presentation to the UW Center for Statistics and Social Sciences (12 March 2014) (OrcID)
"The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified." Max Kuhn, CRAN Task View: Reproducible Research
Gavish & Gonoho AAAS 2011, Oxberry 2013
"An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result." Claerbout and Karrenbach, Proceedings of the 62nd Annual International Meeting of the Society of Exploration Geophysics. 1992
Technical
Cultural & personal
Peng 2011, Science 334(6060) pp. 1226-1227
A coding error in their Excel spreadsheet sliced several countries out of the data set…. The Economist
Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false. - Science Editor-in-Chief Marcia McNutt
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness. Authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates
Authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
"Lower levels of CSF IL-6 were associated with current depression and with future depression […]" Original conclusion
"Higher levels of CSF IL-6 and IL-8 were associated with current depression […]" Revised conclusion
"Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do." Donald E. Knuth, Literate Programming, 1984
For example… Let's calculate the current time in R.
time <- format(Sys.time(), "%a %d %b %X %Y")
The text and R code are interwoven in the output:
The time is `r time`
The time is Mon 21 Sep 15:05:21 2015
For
Against
The machine-readable part: R
"both a container for the different elements that make up the document and its computations (i.e. text, code, data, etc.), and as a means for distributing, managing and updating the collection… allow us to move from an era of advertisement to one where our scholarship itself is published" Gentleman and Temple Lang 2004
Markdown: lightweight document formatting syntax. Easy to write, read and publish as-is.
The human-readable part
rmarkdown: - minor extensions to allow R code display and execution - embed images in html files (convenient for sharing) - equations
knitr - descendant of Sweave
Engine for dynamic report generation in R
http://kieranhealy.org/blog/archives/2014/01/23/plain-text/
A universal document converter, open source, cross-platform
Payoffs
Costs - Learning curve
RStudio 'projects' make version control & document preparation simple
Payoffs - Free space for hosting (and paid options) - Assignment of persistent DOIs - Tracking citation metrics
Costs - Sometimes license restrictions (CC-BY & CC0) - Limited or no private storage space
Stodden (IASSIST 2010) sampled American academics registered at the Machine Learning conference NIPS (134 responses from 593 requests (23%). Red = communitarian norms, Blue = private incentives
Stodden (IASSIST 2010) sampled American academics registered at the Machine Learning conference NIPS (134 responses from 593 requests (23%). Red = communitarian norms, Blue = private incentives
Reproducible Research Standard (Stodden 2009)
Promote culture change through positive attribution
Implement mechanisms to indicate & encourage degrees of compliance (ie. clear definitions for different levels of reproducibility), cf. Stodden's:
Demo drawn using materials from Dr. Çetinkaya-Rundel
GEO503 (or similar)Live demo
Cheatsheet:
https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
File -> New File -> RMarkdown -> Document -> HTML
All R code to be run must be in a code chunk like this:
#```{r,eval=F}
CODE HERE
#```
Load these packages in a code chunk:
library(dplyr) library(ggplot2) library(spocc)
Do you think you should put
install.packages()calls in your script?
Now use the occ() function to download all the occurrence records for the American robin (Turdus migratorius) from the Global Biodiversity Information Facility.
Licensed under CC BY-SA 3.0 via Wikimedia Commons
This can take a few seconds.
## define which species to query sp='Turdus migratorius' ## run the query and convert to data.frame() d = occ(query=sp, from='ebird',limit = 10000) %>% occ2df()
ggplot(d,aes(x=format(date,"%m"),y=latitude,group=1))+ geom_point()+ geom_smooth()
Update the YAML header to keep the markdown file
From this:
title: "Untitled" author: "Adam M. Wilson" date: "September 21, 2015" output: html_document
To this:
title: "Demo"
author: "Adam M. Wilson"
date: "September 21, 2015"
output:
html_document:
keep_md: true
Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry.Raymond, E. S., 2004, The art of UNIX programming: Addison-Wesley.
Slides based on Ben Marwick's presentation to the UW Center for Statistics and Social Sciences (12 March 2014) (OrcID)
See Rpres file on github for full references and sources